Skip to content

Conversation

@mertalev
Copy link
Contributor

@mertalev mertalev commented Dec 22, 2025

Pull Request Template

Checklist

  • Confirmed that cargo run-checks command has been executed.
  • Made sure the book is up to date with changes in this PR.

Changes

Adds an nms op to burn-vision for detection use-cases. It matches the options of the NonMaxSuppression ONNX op to make it easy to support in later PRs.

Testing

I compared outputs from torchvision and this implementation and confirm they match with the same settings.

Note: I also wrote a CubeCL kernel for GPU acceleration, but try as I might, it was just slower than the CPU SIMD implementation. The data size is not that large for most applications and part of the algorithm is inherently sequential. The best I could get it was ~70% slower than CPU for 800 boxes, 16% slower for 3200 and about the same for 12800. I decided to omit it from the PR because it's just faster to do it on CPU and transfer back. Maybe there are use-cases or scenarios I'm not considering.

cubecl impl

block-parallel

bitmask-parallel

two phase

optimize for cpu

formatting

simplify response

remove kernel

refactor

tensor op for ergonomics

linting, tests
@codecov
Copy link

codecov bot commented Dec 22, 2025

Codecov Report

❌ Patch coverage is 93.10345% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.88%. Comparing base (9f31281) to head (58df53d).

Files with missing lines Patch % Lines
crates/burn-vision/src/backends/cpu/nms.rs 94.20% 8 Missing ⚠️
crates/burn-vision/src/ops/base.rs 56.25% 7 Missing ⚠️
crates/burn-vision/src/tensor.rs 83.33% 1 Missing ⚠️

❌ Your project check has failed because the head coverage (68.88%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4246      +/-   ##
==========================================
+ Coverage   68.85%   68.88%   +0.03%     
==========================================
  Files        1405     1407       +2     
  Lines      167686   167918     +232     
==========================================
+ Hits       115456   115669     +213     
- Misses      52230    52249      +19     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AdrianEddy
Copy link

for context:
I had an use case for GPU-based nms, where I needed to run my entire pipeline on the GPU to avoid ANY transfers of data between CPU and GPU except the first jpeg bytes, and last final output. That pipeline ran entirely on the GPU and I've written the NMS with CUDA to achieve that, and it was much faster than doing it in CPU mainly because I avoided any data transfers whatsoever.
Did you consider that the source and output data was already in the GPU for your benchmarks? I think the data transfer itself might be a bottleneck here, so when your source data is on the CPU, the CPU version is faster, just because of that data transfer.

I'm not saying it is, just saying it may still be useful to have the GPU-only NMS, as I have used that myself.

I believe having the both versions available is ideal here, so users can choose which one fits their use case best

@mertalev
Copy link
Contributor Author

The data was on the device being tested at the start, no transfers included. I didn't test the time to transfer the NMS input to CPU and back, though.

I can include the kernel for completeness. I only tested a specific scenario on Apple Silicon, so maybe it's worth it in other cases. I'm sure the kernel has room for optimization as well.

@AdrianEddy
Copy link

AdrianEddy commented Dec 25, 2025

one additional factor to consider here is that on apple silicon the data transfers are much faster because of unified memory. On Linux/CUDA the transfer itself might have a much bigger overhead because it's a separate device

so in that case, even if the NMS algorithm itself on GPU is potentially slower - it may still be beneficial to run the slower algorithm on device but avoid data transfer, than running a faster algorithm on CPU but pay for data transfer

@antimora antimora requested a review from wingertge December 27, 2025 17:58
Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Thanks for the accelerated NMS implementation.

Just some minor comments below.

For a GPU implementation, I think the benefits typically come with a batched NMS implementation, which involves some tricks. Most commonly, offset the box coordinates per class so it's applied independently (ref: torchvision batched_nms).

Comment on lines 44 to 52
// Filter by score threshold and convert to SoA format in one pass
let score_thresh = options.score_threshold;
let mut x1s = Vec::with_capacity(n_boxes);
let mut y1s = Vec::with_capacity(n_boxes);
let mut x2s = Vec::with_capacity(n_boxes);
let mut y2s = Vec::with_capacity(n_boxes);
let mut areas = Vec::with_capacity(n_boxes);
let mut filtered_scores = Vec::with_capacity(n_boxes);
let mut original_indices = Vec::with_capacity(n_boxes);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of allocating 5 separate Vec<f32>s, it would likely be better to have a single vec for filtered boxes and then use a single contiguous "scratchpad" for the filtered box data to reduce this into a single heap allocation:

// After filtering and sorting, we know the number of filtered boxes `n_filtered`
let mut scratchpad = vec![0.0f32; n_filtered * 5];

// Split into slices
let (x1s, rest) = scratchpad.split_at_mut(n_filtered);
let (y1s, rest) = rest.split_at_mut(n_filtered);
let (x2s, rest) = rest.split_at_mut(n_filtered);
let (y2s, areas) = rest.split_at_mut(n_filtered);

for (i, b) in filtered.iter().enumerate() {
    // Fill the slices with the filtered box coordinates and areas
}

And the previous filtering can simply be applied on the indices instead:

let mut filtered_indices = Vec::with_capacity(n_boxes);

for i in 0..n_boxes {
    let score = scores_vec[i];
    if score >= options.score_threshold {
        filtered_indices.push(i); // original index
    }
}

let n_filtered = filtered_indices.len();
if n_filtered == 0 {
    return vec![];
}

// Note: I used total_cmp instead (which should handle NaNs instead of panicking)
filtered_indices.sort_by(|&a, &b| scores_vec[b].total_cmp(&scores_vec[a]));

This also makes it easier to implement aligned loads in the future if desired. Even without that, the single contiguous allocation should already be an improvement I believe.

You could benchmark both implementations to see if it's actually worth it.

Copy link
Contributor Author

@mertalev mertalev Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It made for a nice improvement!

Unaligned:

Benchmark Burn Version Shapes Feature Backend Device Median
nms-n100-f32 local [(100, 4)(100)] ndarray ndarray Cpu 17.000µs
nms-n1000-f32 local [(1000, 4)(1000)] ndarray ndarray Cpu 438.000µs
nms-n10000-f32 local [(10000, 4)(10000)] ndarray ndarray Cpu 21.385ms
nms-n5000-f32 local [(5000, 4)(5000)] ndarray ndarray Cpu 6.785ms

Aligned:

Benchmark Burn Version Shapes Feature Backend Device Median
nms-n100-f32 local [(100, 4)(100)] ndarray ndarray Cpu 16.000µs
nms-n1000-f32 local [(1000, 4)(1000)] ndarray ndarray Cpu 373.000µs
nms-n10000-f32 local [(10000, 4)(10000)] ndarray ndarray Cpu 17.337ms
nms-n5000-f32 local [(5000, 4)(5000)] ndarray ndarray Cpu 5.460ms

With all_suppressed optimization:

Benchmark Burn Version Shapes Feature Backend Device Median
nms-n100-f32 local [(100, 4)(100)] ndarray ndarray Cpu 15.000µs
nms-n1000-f32 local [(1000, 4)(1000)] ndarray ndarray Cpu 276.000µs
nms-n10000-f32 local [(10000, 4)(10000)] ndarray ndarray Cpu 11.933ms
nms-n5000-f32 local [(5000, 4)(5000)] ndarray ndarray Cpu 3.692ms

Copy link
Member

@laggui laggui Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting the benchmarks! That's awesome.

Assuming the "unaligned" results are from the previous version, I think the biggest jump in "aligned" likely comes from the single vec allocation (for reads). But the aligned vec is a nice addition on top.

Comment on lines +166 to +177
/// Non-Maximum Suppression options.
#[derive(Clone, Copy, Debug)]
pub struct NmsOptions {
/// IoU threshold for suppression (default: 0.5).
/// Boxes with IoU > threshold with a higher-scoring box are suppressed.
pub iou_threshold: f32,
/// Score threshold to filter boxes before NMS (default: 0.0, i.e., no filtering).
/// Boxes with score < score_threshold are discarded.
pub score_threshold: f32,
/// Maximum number of boxes to keep (0 = unlimited).
pub max_output_boxes: usize,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future notes: might be useful to extend box format to [cx, cy, w, h] on top of the current [x1, y1, x2, y2] format.

@mertalev
Copy link
Contributor Author

mertalev commented Jan 7, 2026

Great job! Thanks for the accelerated NMS implementation.

Just some minor comments below.

For a GPU implementation, I think the benefits typically come with a batched NMS implementation, which involves some tricks. Most commonly, offset the box coordinates per class so it's applied independently (ref: torchvision batched_nms).

Thanks! After squeezing out some more performance, I'm happy with the state of the PR, but I think it's probably worthwhile to circle back to the kernel in the future, possibly when adding batch_nms as you say.

Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments!

LGTM, pending one small change I believe is still required.

Approving in advance nonetheless.

Comment on lines +162 to +170
match lanes {
4 => *(suppressed.as_ptr().add(i) as *const u32) == 0x01010101,
8 => *(suppressed.as_ptr().add(i) as *const u64) == 0x0101010101010101,
16 => {
*(suppressed.as_ptr().add(i) as *const u128)
== 0x01010101010101010101010101010101
}
_ => unreachable!(),
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This casts a *const bool (1-byte aligned) to *const u32/u64/u128 without proper alignment. In practice I think it will work on modern platforms that typically over-align (as I've discovered in the past), but still technically UB.

I think an easy fix would be to simply read unaligned, e.g.

suppressed.as_ptr().add(i).cast::<u32>().read_unaligned()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants